1 Introduction

This vignette demonstrates some numerical and visualization tools to summarize and explore missingness in a meta-analytic dataset. The code used in this vignette is implemented in the R software language and is based on the naniar and visdat libraries. Additional functions are stored in the wrappers.R library.

The purpose of this vignette is to illustrate how to conduct an exploratory missingness analysis (EMA) as described by Schauer et al. (2020). Thus, this vignette can be seen as a companion to that article. Note that several commands are written specifically for visualizing and quantifying missingness in a meta-analysis and are stored in the wrappers.R file downloadable here. However, it is important to note that while the tools presented here can be useful for many datasets, they are not exhaustive.

2 Data

This vignette uses data on substance abuse interventions for adolescents (Tanner-Smith et al., 2016). These data comprise 328 distinct effects of substance abuse interventions on future substance use. Effects arise from 46 studies, and are accompanied by information about estimated treatment effects and variances, as well as over three dozen variables describing how and where the intervention was implemented. While none of the effect estimates are missing, there is missingness on some 18 other variables of interest, including the sample composition of studies and the duration and intensity of the intervention.

Each effect is a contrast between two groups in the study, which are referred to as Group 1 and Group 2 in this vignette. Variables corresponding to Group 1 in the data have names that start with g1, while variables corresponding to Group 2 have names that start with g2.

3 Load Data and Relevant Libraries

We start by loading the necessary packages and the data:

# libraries for visualizations
library(naniar)
library(visdat)

# code to extend libraries for meta-analysis
source("wrappers.R")

# library for data wrangling and plotting
library(tidyverse)

# read in data
adt_data <- readRDS("adt_data.RDS")

Rows of the data correspond to intervention impacts—referred to in this vignette as cases—and columns refer to variables about those impacts, including the estimated effect, the standard error of that estimate, and several other variables regarding effects, several of which are listed in the table below. It is important to note that this vignette includes a greater number of variables than was used in the example EMA in the article, so some of the plots and numerical summaries will differ in this vignette.

Variable Variable Name
Study ID studyid
Timing of Follow-Up estimingpost
Effect Size es_g
Standard Error se_g
Design design
State state
Group 1 Manualized g1man
Group 1 Monitored g1mon
Group 1 Treatment Category g1txcat
Group 1 Treatment Location g1loc
Group 1 Hrs Per Week g1hrsperweek
Group 1 Treatment Days g1txdays
Group 1 No. Session g1numsessions
Group 1 Mean Age g1age
Group 1 Pct White g1perwhite
Group 1 Pct Black g1perblack
Group 1 Pct Hispanic g1perhisp
Group 1 Pct Male g1permale
Group 2 Manualized g2man
Group 2 Monitored g2mon
Group 2 Treatment Category g2txcat
Group 2 Treatment Location g2loc
Group 2 Hrs Per Week g2hrsperweek
Group 2 Treatment Days g2txdays
Group 2 No. Session g2numsessions
Group 2 Mean Age g2age
Group 2 Pct White g2perwhite
Group 2 Pct Black g2perblack
Group 2 Pct Hispanic g2perhisp
Group 2 Pct Male g2permale



4 Summary Visualizations with visdat

The visdat library visualizes the entire data frame at once (Tierney, 2017). The plots below are laid out in tiles that mirror the structure of the data: Rows correspond to cases and columns correspond to variables concerning each case. Missing cells are indicated by black tiles.

# summary plots using vis_dat and vis_miss
gg_summary_covariate_miss(adt_data)


The left plot presents a summary of the dataset structure while highlighting variables with missing data; we can initially identify 18 variables of interest with some missingness. Further, we get a better idea of how much data is missing by looking at the second plot (6.2%).

5 Summarizing Missingness with naniar

The naniar package also provides an approach to visualizing overall missingness of the dataset (Tierney & Cook, 2018).

We can plot the number of cases missing for each variable. In the plot below, variable names appear along the \(y\)-axis, and lines indicate the number of cases for which each variable is missing.

# summary of missing variables
gg_miss_var(adt_data) +
  labs(y="Missing Cases", 
       title = "Total Missingness by Variable") +
  theme(axis.title.x = element_text(size =10),
        axis.text.y = element_text(size=6))



Missingness can also be shown as proportions for each variable:

# summary of missing variables (percentage)
gg_miss_var(adt_data, show_pct=TRUE) +
  labs(y="Percentage Missing", 
       title = "Percent Missingness by Variable") +
  theme(axis.title.x = element_text(size =10),
        axis.text.y = element_text(size=6))



We can also look at the number of missing values in each case. In the plot below, the \(y\)-axis refers to each case in the data, and the \(x\)-axis indicates the number of variables that cases are missing.

# number of missing values in each case
gg_miss_case(adt_data, order_cases=TRUE) + 
  labs(x="Number of Cases",
       title = "Total Missingness by Case")



Other summary plots show cumulative counts of cells in the data that are missing data on the \(y\)-axis. The plot below shows the cumulative count of missing fields as a function of case number (i.e., by reading the data from the top row to the bottom).

#cumulative sum of missing values
gg_miss_case_cumsum(adt_data) +
  labs(title = "Cumulative Missingness by Case")



A similiar plot (below) shows the cumulative count of missing cells (\(y\)-axis) as a function of the variables in the data (i.e., by reading the data from left to right).

6 Explore Missingness Across Categorical Variables

It will often be of interest whether missingness is correlated with the value of observed variables. The following plots show the number of missing cases versus the level of care given to Group 1 (top figure) and Group 2 (bottom figure) in the data. Each plot in the figures corresponds to a level of care. The \(y\)-axis of the plots refer to the cases, and the \(x\)-axis shows how many variables are missing for each case. In the plot for Group 1, we see that cases where Group 1 received inpatient care are seldom missing other variables.

# level of care group 1
gg_miss_case(adt_data, facet=g1loc) +
  labs(title = "Missingess vs. Group 1 Level of Care")

# level of care group 2
gg_miss_case(adt_data, facet=g2loc) +
  labs(title = "Missingess vs. Group 2 Level of Care")



An alternative visualization shows the percentage of rows missing data as a function of the levels of a given variable. In the plot below, we see missingness in other variables as a function of Group 2’s level of care. The plot is laid out in tiles with rows corresponding to variables and columns corresponding to the level of care received by Group 2. Cells are shaded according to the amount of missingness for each variable.

# level of care group 2
gg_miss_fct(x=adt_data, fct=g2loc) +
  labs(x="Group 2 Level of Care") +
  theme(axis.text.y = element_text(size=7)) +
  labs(title = "Missingess vs. Group 2 Level of Care")



7 Missingness Patterns

A key tool in examining missingness patterns is the upset plot. The upset plot reveals which combinations of variables are often missing from the same cases. The plot below has two panes, one on top, and one on the bottom. The bottom pane shows a collection of variables and the dots indicate that a set of variables are missing from the same cases. The top pane shows how frequently each pattern of missingness occurs.

gg_miss_upset(adt_data, nsets=11, nintersects=NA)

Another approach is to look at pairwise correlations of missingness indicators. An easy way to do this is with a correlation heatmap. In R, the the ggcorrplot() function from the ggcorrplot library is useful for visualizing correlation matrices:

library(ggcorrplot)
library(viridis)

# Set up a data frame with missingness indicators
adt_mc <- adt_data %>%
  select_if(any_na) %>%
  mutate_all(.funs = function(x) ifelse(is.na(x), 1, 0))

# Get pairwise correlations of missingness
miscor <- cor(adt_mc) 

# Visualize the correlation matrix
ggcorrplot(miscor, type = "lower") +
  viridis::scale_fill_viridis("Correlation of\nMissingness")



8 Missingness and Effect Size Estimates

A key relationship to examine is the correlation of missingness with observed effect sizes or standard errors of effect size estimates. There are several visualizations that can help unpack these relationships.

8.1 Density Plots

A useful object for studying this is a shadow data frame that includes columns that indicate whether a variable is missing. These columns, which contain a suffix _NA to variable names, can be thought of as the response indicator matrix described in the article.

# build shadow data frame
adt_shadow <- bind_shadow(adt_data)

Using the shadow data frame, the following code, implemented in the wrappers.R script visualizes potential relationships between missingness and effect size estimates. Each figure contains a pair of plots, and each plot shows the density of the effect sizes or standard errors colored according to whether a specified covariate is missing.

# missingness in group 1 hours per week
gg_esse_covariate_miss(adt_shadow,
                       es_col = "es_g",
                       se_col = "se_g",
                       covariate = "g1hrsperweek",
                       adjust = c(1.3, 1.2), # Adjust smoothing for ES and SE densitites
                       label = "Group 1 Hrs Per Week")

# missingness in group 1 duration of treatment 
gg_esse_covariate_miss(adt_shadow,
                       es_col = "es_g",
                       se_col = "se_g",
                       covariate = "g1txdays",
                       adjust = c(1.3, 1.2), 
                       label = "Group 1 Duration of Treatment (Days)")

# missingness in group 2 hours per week
gg_esse_covariate_miss(adt_shadow,
                       es_col = "es_g",
                       se_col = "se_g",
                       covariate = "g2hrsperweek",
                       adjust = c(1.3, 1.2), 
                       label = "Group 2 Hrs Per Week")



8.2 Forest Plots

An alternative to the above plots is to build more traditional forest plots, common in most meta-analyses, that shade effects according to whether a covariate is missing. The plot below shows a forest plot where effects are shaded by whether Group 1’s treatment intensity (in hours per week) is missing. Effects are arranged according to missingness of the treatment intensity, and by their standard errors in the plot below. Note that this only includes a subset of the the data to better demonstrate these plots.

# forest plot
gg_forest_covariate_miss(
    adt_shadow %>% sample_frac(.5), # subset of the data
    es_col = "es_g",
    se_col = "se_g",
    covariate = "g1hrsperweek",
    label = "Group 1 Hrs \nPer Week",
    arrange_by = "SE"
    ) +
labs(title = "Forest Plot vs. Missing Group 1 Treatment Intensity")

8.3 Scatterplots

Finally, it may be helpful to view the distribution of missingness in a potential moderator of effects (i.e., a covariate) relative to the relationshpi between that moderator and effects. In the plot below, effect estimates (\(y\)-axis) are plotted against Group 1’s treatment intensity (in hours per week) on the \(x\)-axis. Effects that are missing the treatment intensity variable are shown on the left side of the plot. Each point corresponds to a single effect estimate. In the second plot below, those points are sized according to the standard error of each effect estimate.

# effect size vs group 1 hours per week
ggplot(adt_data) +
  aes(x = g1hrsperweek,
      y = es_g) +
  geom_miss_point(alpha = .8) +
  scale_color_manual("Group 1 Hrs Per Week", 
                     values = c("#B9DE52", "#46095D")) +
  labs(title = "Effect Size vs. Group 1 Treatment Intensity")

# effect size vs group 1 hours per week
# points sized by SE
pointsize <- (1/adt_data$se_g^2) / mean(1/adt_data$se_g^2) * 2
ggplot(adt_data) +
  aes(x = g1hrsperweek,
      y = es_g) +
  geom_miss_point(alpha = .5, 
                  size = pointsize) +
  scale_color_manual("Group 1 Hrs Per Week", 
                     values = c("#B9DE52", "#46095D")) +
  labs(title = "Effect Size vs. Group 1 Treatment Intensity")



9 Numerical Summaries

Finally, there are various numerical summaries of missingness. There are a series of statistics, including the number and proportion of cells in the data that are missing values.

# total number of missing cells
miss_cells <- n_miss(adt_data) 
paste("Number of missing cells is", miss_cells)
## [1] "Number of missing cells is 936"
# total number of complete cells
comp_cells <- n_complete(adt_data)
paste("Number of complete cells is", comp_cells)
## [1] "Number of complete cells is 14152"
# proportion of missing cells
pct_miss_cells <- prop_miss(adt_data)
paste0("The % of missing cells is ", 
      100 * round(pct_miss_cells, 3), 
      "%")
## [1] "The % of missing cells is 6.2%"
# proportion of complete cells
pct_comp_cells <- prop_complete(adt_data)
paste0("The % of complete cells is ", 
      100 * round(pct_comp_cells, 3), 
      "%")
## [1] "The % of complete cells is 93.8%"



It is also useful to compute missingness by variable. The following table shows the number and proportion of cases missing a given variable.

#Percentage missing by variable
miss_var_summary(adt_data) %>% 
  kable(digits = 2) %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "200px")
variable n_miss pct_miss
g2hrsperweek 152 46.34
g2txdays 117 35.67
g2numsessions 111 33.84
g2loc 109 33.23
g1hrsperweek 79 24.09
g2perblack 75 22.87
g1perblack 71 21.65
g1perhisp 56 17.07
g2perhisp 51 15.55
g1numsessions 35 10.67
g2perwhite 18 5.49
g2age 14 4.27
g2pernonwhite 13 3.96
g1perwhite 11 3.35
g1permale 9 2.74
g1age 7 2.13
g1pernonwhite 6 1.83
g1txdays 2 0.61
pk_es 0 0.00
studyid 0 0.00
groupid1 0 0.00
groupid2 0 0.00
varid 0 0.00
estimingpost 0 0.00
es_g 0 0.00
se_g 0 0.00
authoryear 0 0.00
state 0 0.00
_referencesummary 0 0.00
design 0 0.00
g1txcat 0 0.00
g1loc 0 0.00
g1attrition 0 0.00
g1mon 0 0.00
g1imp 0 0.00
g1man 0 0.00
g2txcat 0 0.00
g2attrition 0 0.00
g2permale 0 0.00
g2mon 0 0.00
g2imp 0 0.00
g2man 0 0.00
dvmicro 0 0.00
dvsucat 0 0.00
estxobnpost 0 0.00
esctobnpost 0 0.00



The following table, implemented in wrappers.R computes the precision-weighted percentage of missingness in addition to the raw count and proportion of missingness for each variable:

# same thing, but with a wrapper function for weighted percents
mis_ma_var_summary(adt_data, "se_g", truncate = TRUE) %>% 
  kable(digits = 2) %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "200px")
Variable n_miss pct_miss wtpct_miss
g2hrsperweek 152 46.34 45.20
g1hrsperweek 79 24.09 36.76
g2txdays 117 35.67 32.63
g2numsessions 111 33.84 30.17
g2loc 109 33.23 29.97
g2perblack 75 22.87 17.51
g1perblack 71 21.65 16.06
g1perhisp 56 17.07 12.71
g2perhisp 51 15.55 12.45
g1numsessions 35 10.67 6.93
g2perwhite 18 5.49 4.57
g1perwhite 11 3.35 3.05
g2age 14 4.27 2.97
g2pernonwhite 13 3.96 2.33
g1age 7 2.13 1.45
g1permale 9 2.74 1.34
g1pernonwhite 6 1.83 0.82
g1txdays 2 0.61 0.10



Finally, we can tabulate the total number and percentage of variables that are missing for each case:

# percentage missing by case (Study)
miss_case_summary(adt_data) %>%
  kable(digits = 2) %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "200px")
case n_miss pct_miss
15 17 36.96
16 17 36.96
199 12 26.09
266 10 21.74
267 10 21.74
268 10 21.74
269 10 21.74
270 10 21.74
271 10 21.74
272 10 21.74
178 9 19.57
153 8 17.39
154 8 17.39
156 8 17.39
157 8 17.39
159 8 17.39
160 8 17.39
162 8 17.39
163 8 17.39
165 8 17.39
166 8 17.39
168 8 17.39
169 8 17.39
171 8 17.39
173 8 17.39
174 8 17.39
176 8 17.39
177 8 17.39
179 7 15.22
180 7 15.22
181 7 15.22
182 7 15.22
183 7 15.22
184 7 15.22
185 7 15.22
78 6 13.04
80 6 13.04
5 5 10.87
6 5 10.87
7 5 10.87
8 5 10.87
9 5 10.87
35 5 10.87
36 5 10.87
37 5 10.87
38 5 10.87
39 5 10.87
40 5 10.87
41 5 10.87
42 5 10.87
43 5 10.87
44 5 10.87
45 5 10.87
46 5 10.87
47 5 10.87
48 5 10.87
49 5 10.87
50 5 10.87
51 5 10.87
52 5 10.87
53 5 10.87
54 5 10.87
55 5 10.87
56 5 10.87
57 5 10.87
58 5 10.87
59 5 10.87
117 5 10.87
118 5 10.87
119 5 10.87
123 5 10.87
124 5 10.87
125 5 10.87
143 5 10.87
144 5 10.87
186 5 10.87
187 5 10.87
10 4 8.70
11 4 8.70
12 4 8.70
13 4 8.70
14 4 8.70
63 4 8.70
64 4 8.70
65 4 8.70
66 4 8.70
106 4 8.70
107 4 8.70
108 4 8.70
129 4 8.70
130 4 8.70
131 4 8.70
132 4 8.70
133 4 8.70
145 4 8.70
146 4 8.70
147 4 8.70
148 4 8.70
152 4 8.70
155 4 8.70
158 4 8.70
161 4 8.70
164 4 8.70
167 4 8.70
170 4 8.70
172 4 8.70
175 4 8.70
260 4 8.70
261 4 8.70
262 4 8.70
263 4 8.70
264 4 8.70
265 4 8.70
275 4 8.70
276 4 8.70
278 4 8.70
279 4 8.70
281 4 8.70
282 4 8.70
284 4 8.70
285 4 8.70
287 4 8.70
288 4 8.70
290 4 8.70
291 4 8.70
293 4 8.70
294 4 8.70
296 4 8.70
297 4 8.70
299 4 8.70
300 4 8.70
302 4 8.70
303 4 8.70
305 4 8.70
306 4 8.70
308 4 8.70
309 4 8.70
1 2 4.35
2 2 4.35
3 2 4.35
4 2 4.35
17 2 4.35
18 2 4.35
19 2 4.35
20 2 4.35
21 2 4.35
22 2 4.35
23 2 4.35
24 2 4.35
25 2 4.35
26 2 4.35
27 2 4.35
28 2 4.35
29 2 4.35
30 2 4.35
31 2 4.35
32 2 4.35
33 2 4.35
34 2 4.35
70 2 4.35
71 2 4.35
72 2 4.35
73 2 4.35
74 2 4.35
75 2 4.35
76 2 4.35
77 2 4.35
79 2 4.35
98 2 4.35
99 2 4.35
100 2 4.35
101 2 4.35
102 2 4.35
103 2 4.35
104 2 4.35
105 2 4.35
120 2 4.35
121 2 4.35
122 2 4.35
126 2 4.35
127 2 4.35
128 2 4.35
194 2 4.35
195 2 4.35
196 2 4.35
200 2 4.35
203 2 4.35
209 2 4.35
215 2 4.35
221 2 4.35
226 2 4.35
232 2 4.35
238 2 4.35
239 2 4.35
240 2 4.35
241 2 4.35
242 2 4.35
243 2 4.35
244 2 4.35
245 2 4.35
246 2 4.35
247 2 4.35
248 2 4.35
249 2 4.35
311 2 4.35
317 2 4.35
201 1 2.17
204 1 2.17
205 1 2.17
206 1 2.17
207 1 2.17
210 1 2.17
211 1 2.17
212 1 2.17
213 1 2.17
216 1 2.17
217 1 2.17
218 1 2.17
219 1 2.17
222 1 2.17
223 1 2.17
224 1 2.17
225 1 2.17
227 1 2.17
228 1 2.17
229 1 2.17
230 1 2.17
233 1 2.17
234 1 2.17
235 1 2.17
236 1 2.17
312 1 2.17
313 1 2.17
314 1 2.17
315 1 2.17
318 1 2.17
319 1 2.17
320 1 2.17
321 1 2.17
322 1 2.17
323 1 2.17
324 1 2.17
60 0 0.00
61 0 0.00
62 0 0.00
67 0 0.00
68 0 0.00
69 0 0.00
81 0 0.00
82 0 0.00
83 0 0.00
84 0 0.00
85 0 0.00
86 0 0.00
87 0 0.00
88 0 0.00
89 0 0.00
90 0 0.00
91 0 0.00
92 0 0.00
93 0 0.00
94 0 0.00
95 0 0.00
96 0 0.00
97 0 0.00
109 0 0.00
110 0 0.00
111 0 0.00
112 0 0.00
113 0 0.00
114 0 0.00
115 0 0.00
116 0 0.00
134 0 0.00
135 0 0.00
136 0 0.00
137 0 0.00
138 0 0.00
139 0 0.00
140 0 0.00
141 0 0.00
142 0 0.00
149 0 0.00
150 0 0.00
151 0 0.00
188 0 0.00
189 0 0.00
190 0 0.00
191 0 0.00
192 0 0.00
193 0 0.00
197 0 0.00
198 0 0.00
202 0 0.00
208 0 0.00
214 0 0.00
220 0 0.00
231 0 0.00
237 0 0.00
250 0 0.00
251 0 0.00
252 0 0.00
253 0 0.00
254 0 0.00
255 0 0.00
256 0 0.00
257 0 0.00
258 0 0.00
259 0 0.00
273 0 0.00
274 0 0.00
277 0 0.00
280 0 0.00
283 0 0.00
286 0 0.00
289 0 0.00
292 0 0.00
295 0 0.00
298 0 0.00
301 0 0.00
304 0 0.00
307 0 0.00
310 0 0.00
316 0 0.00
325 0 0.00
326 0 0.00
327 0 0.00
328 0 0.00



References

Tanner-Smith EE, Steinka-Fry KT, Kettrey HH, Lipsey MW. (2016) Adolescent substance use treatment effectiveness: A systematic review and meta-analysis. Office of Justice Programs.
Tierney NJ. (2017) visdat: Visualising whole data frames. JOSS 2: 355.
Tierney NJ, Cook DH. (2018) Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations. arXiv:180902264 [stat].